Proceedings of DMKD ' 03 8 th ACM SIGMOD Workshop on Research Issues in Data Mining and

نویسندگان

  • Mohammed J. Zaki
  • Charu C. Aggarwal
چکیده

Continuous data streams arise naturally, for example, in the installations of large telecom and Internet service providers where detailed usage information (Call-Detail-Records, SNMP/RMON packet-flow data, etc.) from different parts of the underlying network needs to be continuously collected and analyzed for interesting trends. Such environments raise a critical need for effective stream-processing algorithms that can provide (typically, approximate) answers to data-analysis queries while utilizing only small space (to maintain concise stream synopses) and small processing time per stream item. In this talk, I will discuss the basic pseudo-random sketching mechanism for building stream synopses and our ongoing work that exploits sketch synopses to build an approximate SQL (multi) query processor. I will also describe our recent results on extending sketching to handle more complex forms of queries and streaming data (e.g., similarity joins over streams of XML trees), and try to identify some challenging open problems in the data-streaming area. DMKD03: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003 page 1 DMKD03: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003 page 2 A Symbolic Representation of Time Series, with Implications for Streaming Algorithms Jessica Lin Eamonn Keogh Stefano Lonardi Bill Chiu University of California Riverside Computer Science & Engineering Department Riverside, CA 92521, USA {jessica, eamonn, stelo, bill}@cs.ucr.edu ABSTRACT The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued. Many researchers have also considered transforming real valued time series into symbolic representations, noting that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly “batch-only” problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms. In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued. Many researchers have also considered transforming real valued time series into symbolic representations, noting that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly “batch-only” problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms. In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead. We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

User-Defined Aggregates for Datamining

User-deened aggregates can be the linchpin of sophisticated datamining functions and other advanced database applications. This is demonstrated by our eecient implementation on DB2 of SQL3 user-deened aggregates extended with early returns, which we have used to implement several data mining algorithms. Aggregates with early returns are monotonic and can thus be used freely in recursive queries.

متن کامل

Subset Scanning for Event and Pattern Detection

group on knowledge discovery and data mining, Boston, pp 71–80 Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT, Cambridge Hershberger J, Shrivastava N, Suri S (2006) Cluster hulls: a technique for summarizing spatial data streams. In: Proceedings of IEEE international conference on data engineering, Atlanta, p 138 Hulten G, Spencer L, Domingos P (2001) Mining timechanging data...

متن کامل

Database System Extensions for Decision Support: the AXL Approach

Research on database-centric data mining is seeking to improve the e ectiveness of database systems in decision support applications. Di erent solutions are now used for di erent problems, including (i) SQL extensions for more complex OLAP queries, (ii) new datablades for special data types such as time-series, and (iii) architectural extensions to support data mining functions. Here, we propos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009